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Abstract 

The distribution functions of the codon usage probabihties, computed over all the available 
GenBank data, for 40 eukaryotic biological species and 5 chloroplasts, do not follow a Zipf law, 
but are best fitted by the sum of a constant, an exponential and a linear function in the rank of 
usage. For mitochondriae the analysis is not conclusive. A quantum-mechanics-inspired model 
is proposed to describe the observed behaviour. These functions are characterized by parameters 
that strongly depend on the total GC content of the coding regions of biological species. It is 
predicted that the codon usage is the same in all exonic genes with the same GC content. The 
Shannon entropy for codons, also strongly depending on the exonic GG content, is computed. 
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1 Introduction 



In the recent past, some interest has been shown in applying methods of statistical linguistics and 
information theory for the analysis of DNA sequences, in particular in investigating whether the 
frequency distribution of nucleotides or sequences of nucleotides follows Zipf's law and using the 
Shannon entropy to identify the redundancy or the bias of a nucleotides sequence. Let us recall that, 
at the end of the forties, Zipf remarked that, in natural languages and in many other domains, the 
distribution function follows an inverse power law, which can be described, denoting by rank n = 1 
the most used word, by n = 2 the next one and so on, and with a > 0, by 



In 1992, it was shown j2] that the distribution of the nucleotides in DNA follows a Zipf law of the 
type given by eq. Jl]). However, the opinions on this statement are divided, see Soon after 
IHIH], it was shown that noncoding sequences of DNA are more similar to natural languages than 
the coding ones, and the Shannon entropy has been used to quantify the redundancy of word. This 
work also raised a debate in the literature [B]. Also, Zipf's law seems well adapted to represent the 
abundance of expressed genes, with an exponent a ~ 1, as can be seen in the recent work of ref. 
The origin of Zipf's law is thought to arise from stochastic processes [8j, especially when they can be 
modelled as random walks in log scale [H]; for further analysis, see [lOj. However, an analysis of the 
rank distribution for codons, performed in many genes for several biological species, led the authors 
of p2] to fit experimental data with an exponential function. In particular, by considering separately 
different coding DNA sequences, they studied the relation between the parameter in the exponential, 
the frequency of the rank one and the length of the sequence for different genes. From this very short 
overview, it follows that the determination of the kind of law the codon rank distribution follows is 
extremely interesting, in investigations of the nature of the evolutionary process, which has acted 
upon the codon distribution, i.e. the eventual presence of a bias. 

In the last few years, the number of available data for coding sequences has considerably increased, 
but apparently no analysis using the whole set of data has been performed. The aim of this paper is 
to perform such a study. As the result of our investigation, we will point out that the rank of codon 
usage probabilities follows a universal law, the frequency function of the rank-n codon showing up as 
a sum of an exponential part and a linear part. A quantum-mechanics-inspired model is proposed, 
which provides for the form of the observed law and gives the correct sign in the different terms. 
Such a universal behaviour suggests the presence of general biases, one of which is identified with the 
total exonic GC content. Indeed, the values of the parameters appearing in the fitting expression are 
plotted versus the total percentage of exonic GC content of the biological species and are reasonably 
well fitted by a parabola. Finally, from the obtained expression, we derive the theoretical prediction 
that the usage probability for rank- ordered codons is the same in any genie region having the same 
exonic GG content for any biological species. 

We compute the Shannon entropy [13] for amino-acids and find that its behaviour in function of 
the exonic GG content is also a parabola, whose apex is around the value 0.50 of the GG content. 



2 Codon usage probabilities distribution 



Let us define the usage probability for the codon XZN (X, Z,N E {A, C, G, U}) as 

P{XZN)= fim (2) 

where nxzN is the number of times the codon XZN has been used in the analysed biosynthesis 
process for a given biological species and Ntot is the total number of codons used in all considered 
processes. It follows that our analysis and predictions hold for biological species with sufficiently 
large statistics of codons. For each biological species, codons are ordered following the decreasing 
order of the values of their usage probabilities, i.e. codon number 1 corresponds to the highest value, 
codon number 2 is the next highest, and so on. We denote by /(n) the probability P{XZN) of 
finding XZN is in the n-th position. Of course the same codon occupies in general two different 
positions in the rank distribution function for two different species. We plot /(n) versus the rank 
and we determine that the best fit to the data can be reached by the sum of an exponential function, 
a linear function in the rank and a constant, i.e. 

/(n) = ae"^" - /3n + 7 , (3) 

where 0.0187 <a< 0.0570, 0.050 < rj < 0.136, 0.82 10"^ < /? < 3.63 10-^ 7 = 0,016 are constant 
depending on the biological species. These four constants have to satisfy the normalization condition 

n 

In table [U we list the 40 biological species (6 vertebrates, 4 plants, 3 invertebrates, 2 fungi and 25 
bacteriae) - with a number of codons ranging between 800 000 and 20 000 000 in decreasing order (data 
from GenBank release 129.0 whose codon usage has been fitted, specifying for each biological 

species the value of the parameters, computed by a best-fit procedure and the corresponding x^- 
Here and in the following, the coefficient is defined by 



where Xi are the experimental abscissae, i/i the experimental values and y{xi) the fitted ones. In 
some cases, y{xi) takes vanishing or negative values for a few points, hence the is not reported. 
In figs. [THU we report the plot of /(n) as a function of n for a few biological species {Homo sapiens, 
Drosophila melanogaster, Arabidopsis thaliana, and Escherichia coli). The plot has been cut to 
n = 61 to take into account the fact that in standard code there are three Stop codons (to end 
the biosynthesis process), whose function is very peculiar. For the same reason, the has been 
computed by taking into account the 61 coding codons only. In table we report the type of 
the twenty-first most used codons of the observed rank distribution fin). A similar study, for a 
sample of twenty vertebrates with a codon statistics larger than 100 000, reveals that, for almost all 
biological species, the four most used codons are GAG, CUG, AAG and GAG. All these codons 
have a G nucleotide in third position and three of them encode doublets. An analysis performed on 
the chloroplast codon usage for a sample of five plants gives the same result for the rank distribution 
/(n), see table[2land fig. ^{Chloroplast Arabidopsis thaliana). We also report, in tableEl the values 
of the parameters and the for a sample of nine mitochondriae with a codon statistics larger than 



15 000. The fits for Homo sapiens and Arabidopsis thaliana are presented in figs. El and We point 
out, however, that for mitochondriae the codon usage frequencies distribution for several species (e.g. 
Arabidopsis thaliana or Drosophila melanog aster) is ill fitted by eq. It may be an indication 

that mitochondriae do not follow the universal law Note that the mitochondrial codes have a 
few differences with the eukaryotic code and vary slightly between species, see e.g. [llj. In these 
cases, the has been computed over the corresponding coding codons. The value of the constant 
7 is approximately equal to 1/61 = 0.0164 or to 1/64 = 0.0156, i.e. the value of the codon usage 
probability in the case of uniform and not biased codon distribution. Therefore the other two terms 
in eq. Q can be viewed as the effect of the bias mechanism. The appearance of the linear term is 
more intriguing. Let us remark that in where an exponential function is used to fit the rank 
of usage in genes (not the rank of usage probability), the linear term has not been observed, as its 
contribution becomes noticeable for approximately n > 20. Owing to the analysis of genes (with at 
most a few hundred codons), the fits in that paper end before this value of the rank. It is believed 
that the main causes of codon usage bias are the translational efficiency, the selection pressure and 
the spontaneous mutations. From the smallness of parameter /3 in ((HI), it is tempting to identify 
the latter as the consequence of mutation effect and the first term in as the effect of selection 
pressure, i.e. the interaction with the environment. 

Let us try to build up a simple model to explain the universal structure of eq. ([HI). In a physical 
quantum system, with discrete energy levels, the occupation number of the i-th energetic level is 
proportional to exp{—Ei/kT), where Ei is the energy of the level, T is identified with the temperature 
and k is the Boltzmann constant. Inspired by this analogy, let us assume that the rank of the 
probability distribution is a function of some kind of energy. In this case the codon usage probability 
is, for any codon, equal to f{n) = exp(— ai?„), where a is a positive constant. We assume that the 
system satisfies periodic conditions in the y and z directions, with a very small spacing in the z 
direction, so that the excited modes along this direction can be neglected, and that the interaction 
with the environment can be modelled as the switching on of a constant uniform magnetic field in 
the direction of the z axis. As a consequence, the Landau levels appear, the energetic levels now 
being in natural units 

En-Eo = uj{n + 1/2) , (6) 
where uj is the so-called cyclotron frequency. The wave function is 

^(x,i/,z) = $„(x-/xo)e*'=''^e^'=^% (7) 

where 

ky = — , k, = — , /gZ,/^0, (8) 

and $„(x — /xo) is the wave function of the n-th energy level of a one-dimensional harmonic oscillator 
shifted by Ixq {xq = 27i/qBLy). There is a degeneracy due to the arbitrariness of ky. A quantun 
charged physical system in an excited state spontaneously falls down to a lower state, the equilibrium 
configuration being the system in the fundamental state. In the case of codons, the equilibrium 
condition is the configuration where they occupy different "energy" levels, the situation being the 
analogue of a quantum system in the presence of an external pumping-back source. In order to 
model this situation in a simple way, we assume that the equilibrium probability distribution is a 
function f{n) — f{n)B=o = -^(-£'n,7n), where 7^ is the "spontaneous emission coefficient" of the n-th 



level, which is taken to be very small because of the assumed equilibrium condition. We can make a 
first-order development in 7„: 

dF 

fin) - f\n)B=o = F{En, 7„ = 0) + — , (9) 

where dF/d'jn is negative, the occupation probability decreasing with the increasing emission. We 
compute 7„ in the standard dipole approximation [HI between the n-th and the k-th levels {k <n). 
In the case of our model, we get 

7„an4,„_i. (10) 

So, replacing F{En, 7n = 0) by the Maxwell-Boltzmann expression, we obtain eq. 0. The values of 
the parameters depend on the biological species, as the selection pressure (in the model, the value of 
the charge and of the magnetic field) is different for different biological species. Even if our simple 
model is able to explain the universal behaviour and the right sign of the linear term, one should 
rather consider it as a toy model. However, the fact that one can build a simple mathematical 
scenario, able to reproduce the observed distribution, provides us with an indication of the existence 
of some strong physico-chemical constraints that add to the random effects. 

Since it is well known that the GC content plays a strong role in the evolutionary process, we 
expect the parameters to depend on the total GC content of the genes region (here the total exonic 
GC content), which is indeed correlated with the evolution of the system (see [IH] and references 
therein). We have investigated this dependence and report, in table El the fits of a and /3 to the total 
exonic GC content Yqc of the biological species. One finds that the values of a and P are well fitted 
by polynomial functions (with < Yqc < 100 in percentage): 

a = 0.21145 - 0.00776^00 + 7.92 10"^ yJc; , = 0.0262 , (11) 

10^/3 = 0.10096 - 0.00345^00 + 3.50 10"^ rJc; , = 0.0170 . (12) 

The two parameters a and P appear to be correlated. Indeed the plot representing /3 as a function 
of a is satisfactorily fitted by a regression line (see fig. S)): 

10^ p = 0.00851 + 0.375 a , = 0.0218 . (13) 

The value of the i] parameter is largely uncorrelated with the total exonic GC content. Let us recall 
that, however, rj is a function of a and /3 due to the normalization condition of eq. (jH). Indeed we 
have (assuming e~^^^ ~ 0) ^ 

1 = - — -3^ + 2080/3 + 647. (14) 

Using the fits for a and /3, we can write the probability distribution function for any biological 
species, whose total GC content in per cent in the exonic regions is Yqc, as 

f(n) = (ao + a.YGc + a2>^Jc)e"'" - n{Po + P.Ygc + P2Y^c) + 7 , (15) 

where r] is obtained by solving eq. (fT^ . Of course we are not able to predict which codon occupies the 
n-th rank. Finally, let us remark that the total exonic GC content Yqc has to satisfy the consistency 
condition 

where the sum is over the set / of integers to which the 56 codons containing G and/or C nucleotides 
belong and di is the multiplicity of these nucleotides inside the i-ih codon. 

^Note that the result is almost unchanged if the data are normalized on the 61 coding codons. 



3 Amino-acids rank distribution 



It is natural to wonder if some kind of universality is also present in the rank distribution of amino- 
acids. From the available data for codon usage, we can immediately compute (using the eukaryotic 
code) the frequency of appearance of any amino-acid F{n) {I < n < 20) in the whole set of coding 
sequences. The calculated values as a function of the rank are satisfactorily fitted by a straight line, 

F{n) = Fo- Bn. (17) 

The parameters Fq, B and the corresponding for the fits are reported in table A better fit can 
be obtained in general by using a third-degree polynomial; however, the range of the four parameters 
for this fit is larger than the range of the 2-parameter fit. For a few biological species, we give below 
the parameters for the two fits, see also the figures of table El The plots of the linear fits for few 
biological species are given in table |H1 Note that the 21st point is just the contribution of the Stop 
codons, which of course has not been taken into account for the fits. One can remark that the most 
frequent amino-acid is always above the line. This can be easily understood in the light of eq. 
Indeed, the most frequent amino-acids get, in general, a contribution of the exponential term of Q 
with a low value of n. 



Species 


linear/cubic fits 




Homo sapiens 


lin. / = 0.087 - 0.0036r2 

cub. / = 0.099 - 0.0088n + 57 IQ-^n"^ - 1.7 lO'^n^ 


0.0072 
0.0055 


Arabidopsis thaliana 


lin. / = 0.088 - 0.0036n 

cub. / = 0.099 - 0.0090n + 62 10~^n^ - 1.95 lO'^n^ 


0.0068 
0.0049 


Drosophila melanogaster 


lin. / = 0.087 - 0.0036n 

cub. / = 0.097 - 0.0096n + 76 IQ-^n^ - 2.5 lO^^n^ 


0.0125 
0.0042 


Escherichia coli 


lin. / = 0.090 - 0.0039n 

cub. / = 0.112 - 0.0136n + 105 lO~-^n^ - 3.1 lO'^n^ 


0.0115 
0.0067 



Of course, the frequency of an amino-acid is given by the sum of the frequencies of its encoding 
codons given by If the ranks of the encoding codons were completely random, we do not expect 
that their sum should take equally spaced values, as is the case in a regression line. Therefore, we 
can infer, for the biological species whose amino-acids frequency is very well fitted by a line, the 
existence of some functional constraints on the codon usage. 

We report in table [101 the distribution of the amino-acids for the different biological species. 

However, the behaviour predicted by eq. ^ fits the experimental data very well, while the shape 
of the distribution of amino-acids seems more sensible to the biological species. In fact, one can 
remark on many plots of the amino-acid distributions (see e.g. table E]), the existence of one or two 
plateaux, for which we do not have any explanation yet. 

4 Consequences of probability distribution 

We now derive a few consequences of eq. In the following, we denote by y the local exonic GC 
content (i.e. for coding sequences of genes) for a given biological species. Let us assume that the 
exonic GC content of a biological species is essentially comprised in the interval yi — i/q = A (e.g. 



for Homo sapiens i/q = 35% and yi = 70%). We can write 



/H = ^ r f{y,n)dy. (18) 

Since the l.h.s. of the above equation has the form given by eq. ^ for any n and for any biological 
species, if we do not want to invoke some "fine tuning" in the integrand function f{y,n), we have to 
assume that 

f{y,n)=a{y)e-'^^-biy)n + ^ (19) 

with the condition 

l ryi I ryi 

a = - / a{y)dy, h{y)dy . (20) 

Jyo Jyo 

As a consequence, we predict that the codon usage probability is the same for any codon in any exonic 

genie region with the same GC content. The form of the aijj) and hijj) functions is yet undetermined. 

For Homo sapiens^ we remark that the total exonic GC content Yqc is, in a very good approximation, 

equal to the mean value of the interval [|/o,?/i]. Therefore, inserting (|TR| and (fT!1|l into (fTH| . we derive 

that the functions a(?/) and 6(?/) have to be linear functions of y. This theoretical derivation is 

in accordance with the conclusions of Zeeberg obtained by an analysis for 7357 genes. On a 

quantitative level, using the numerical linear fits of Zeeberg, we find a very good agreement with our 

calculations. Note that this result is not in contradiction with eq. (fT5|l . since the previous analysis 

is valid for the fixed value of the exonic GC content for Homo sapiens. For bacteriae, the range of 

variation A of the local exonic GC content is very small. Therefore we expect the functions a(|/) 

and hijj) to have the same shape as the functions a and [3 given in eqs. (fTTl) and (fT^ . Hence the 

functions a and P depend on the biological species. 

We compute the Shannon entropy, given by 

S = -Y.f{n)\og,f{n) (21) 

n 

for the codons of a biological species and plot it versus the total exonic GC content; see figure in 
table m The Shannon entropy is rather well fitted by a parabola: 

S = 2.2186 + 0.144 Yqc - 0.00146 Y^c , = 0-0315 . (22) 

Note that the parabola has its apex for y ^ 0.50, which is expected for the behaviour of the Shannon 
entropy for two variables (here GC and its complementary AU). 

The same behaviour have been put in evidence by analogous computations made by Zeeberg ^Hl 
for Homo sapiens. So it seems that the entropy in the gene coding sequences and the total exonic 
region entropy in function of the exonic GC content show the same pattern. 

In conclusion, the distribution of the experimental codon probabilities for a large total exonic 
region of several biological species has been very well fitted by the law of eq. The spectrum of 
the distribution is universal, but the codon, which occupies a fixed level, depends on the biological 
species. Indeed, a more detailed analysis shows that, for close biological species, e.g. vertebrates, 
a fixed codon occupies almost the same position in /(n), while for distant biological species the 
codons occupy very different positions in the rank distribution. We have also derived that the codon 
frequency for any genie region is, for fixed biological species and fixed GC content, the same. Entropy 



analysis has shown that the behaviour observed in the genes with different GC content for the same 
biological species is very similar to the one shown by the total exonic region with different GC content 
for the different biological species. 

Acknowledgements: The authors would like to thank A. Arneodo for fruitful discussions. A.S. 
is indebted to the Universite de Savoie for financial support and LAPTH for its kind hospitality. 

References 

[1] G.K. Zipf, Human behaviour and the Principle of Least Effort (Addison- Wesley Press, Cam- 
bridge, MA, 1949). 

[2] R.F. Voss, Phys. Rev. Lett. 68 (1992) 3805; W. Li and K. Kaneko, Europhys. Lett. 17 (1992) 
655; C.-K. Peng, S.V. Buldyrev, S. Havlin, H.E. Stanley and A.L. Goldberger, Nature 356 
(1992) 168. 

[3] S. Nee, Nature 357 (1992) 450; S. Karlin and V. Brendel, Science 259 (1993) 677; D. Larhammar 
and C.A. Chatzidimitriou-Dreissman, Nucl. Acid. Res. 21 (1993) 450. 

[4] R.N. Mantegna, S.V. Buldyrev, A.L. Goldberger, S. Havlin, C.-K. Peng, M. Simons and H.E. 
Stanley, Phys. Rev. Lett. 73 (1994) 3169. 

[5] A. Czirok, R.N. Mantegna, S. Havlin and H.E. Stanley, Phys. Rev. E 52 (1995) 446. 

[6] R. Israeloff, M. Kagalenko and K. Chan, Phys. Rev. Lett. 76 (1996) 1976; S. Bonhoeffer, A.V.M. 
Herz, M.C. Boerlijst, S. Nee, M.A. Nowak and R.M. May, Phys. Rev. Lett. 76 (1996) 1997; R.F. 
Voss, Phys. Rev. Lett. 76 (1996) 1978; R.N. Mantegna, S.V. Buldyrev, A.L. Goldberger, S. 
Havlin, C.-K. Peng, M. Simons and H.E. Stanley, Phys. Rev. Lett. 76 (1996) 1979. 

[7] C. Furusawa and K. Kaneko, Zipf's Law in Gene Expression, physics/0209103. 

[8] A. Czirok, H.E. Stanley and T. Vicsek, Phys. Rev E 53 (1996) 6371. 

[9] K. Kawamura and N. Hatano, Universality of Zipf law, cond-mat/0203455. 

[10] Ph. Bronlet and A. Ausloos, Generalized (m.k) Zipf law for fractional Brownian motion-like 
series with or without effect of an additional linear trend, cond-mat/0209306. 

[11] L. Frappat, A. Sciarrino and P. Sorba, J. Biol. Phys. 17 (2001) 1. 

[12] A. Som, S. Chattapadhyay, J. Chakrabarti and D. Bandyopadhyay, Phys. Rev. E 63 (2001) 
051908. 

[13] C.E. Shannon, Bell System Tech. J. 27 (1948) 623. 

[14] J.J. Sakurai, Advanced Quantum Mechanics (Addison Wesley, Redwood City, CA, 1967) (Xth 
edition, chap. 2). 

[15] R.D. Knight, S.J. Freeland, L.F. Landweber, Genome Biol. 2 (2001) 1, 
(http://genomebiology.eom/2001/2/4/research/0010). 



[16] B. Zeeberg, Genome Research 12 (2002) 944, 

(http : //www . genome . org/cgi/doi/10 . 1101/gr . 213402) . 

[17] Y. Nakamura, T. Gojobori and T. Ikemura, Nucl. Acids Res. 28 (2000) 292, 
http : //www . kazusa . or . jp/ codon/. 



Table 1: Values of the best-fit parameters, eq. (jH)), for the sample of biological species. Types: vrt 
= vertebrates (6), inv = invertebrates (3), pin = plants (4), fng = fungi (2), bet = bacteriae (25). 



Type 


Species 


GC content in % 


a 


7] 


W(3 




vrt 


Homo sapiens 


52.58 


0.0214 


0.073 


1.65 


0.0126 


pin 


Arabidopsis thaliana 


44.55 


0.0185 


0.056 


1.68 


0.0051 


inv 


Drosophila melanogaster 


54.03 


0.0247 


0.081 


1.67 


0.0089 


inv 


Caenorhabditis elegans 


42.79 


0.0216 


0.064 


1.79 


0.0063 


vrt 


Mus musculus 


52.38 


0.0208 


0.071 


1.57 


0.0112 


fng 


Saccharomyces cervisiae 


39.69 


0.0246 


0.069 


1.91 


0.0127 


bet 


Escherichia coli 


50.52 


0.0233 


0.065 


1.91 


0.0112 


vrt 


Rattus norvegicus 


52.87 


0.0222 


0.073 


1.63 


0.0083 


pin 


Oryza sativa japonica 


55.84 


0.0179 


0.073 


1.63 


0.0211 


fng 


Schizosaccharomyces pombe 


39.80 


0.0255 


0.068 


1.98 


0.0036 


bet 


Bacillus subtilis 


44.32 


0.0259 


0.084 


1.71 


0.0241 


bet 


Pseudomonas aeruginosa 


65.70 


0.0538 


0.107 


2.76 


0.0191 


bet 


Mesorhizobium loti 


63.05 


0.0416 


0.093 


2.44 


0.0093 


bet 


Streptomyces coelicolor A3 


72.41 


0.0567 


0.098 


3.14 


0.0456 


bet 


Sinorhizobium meliloti 


62.71 


0.0359 


0.076 


2.54 


0.0067 


bet 


Nostoc sp. PCC 7120 


42.36 


0.0288 


0.098 


1.63 


0.0140 


pin 


Oryza sativa 


54.63 


0.0173 


0.062 


1.59 


0.0135 


bet 


Agrobacterium tumefaciens str. C58 


59.74 


0.0308 


0.067 


2.43 


0.0100 


bet 


Ralstonia solanacearum 


67.57 


0.0543 


0.105 


2.87 


0.0149 


bet 


Yersinia pestis 


48.97 


0.0179 


0.040 


2.17 


0.0066 


bet 


Methanosarcina acetivorans str. C24 


45.17 


0.0228 


0.068 


1.81 


0.0214 


bet 


Vibrio cholerae 


47.35 


0.0203 


0.052 


2.02 


0.0100 


bet 


Escherichia coli K12 


51.83 


0.0250 


0.065 


2.05 


0.0117 


bet 


Mycobacterium tubercolosis CDC1551 


65.77 


0.0401 


0.094 


2.35 


0.0105 


bet 


Mycobacterium tubercolosis H37Rv 


65.90 


0.0414 


0.097 


2.29 


0.0109 


bet 


Bacillus halodurans 


44.32 


0.0263 


0.100 


1.27 


0.0233 


bet 


Clostridium acetobutylicum 


31.59 


0.0434 


0.087 


2.76 


- 


bet 


Caulobacter crescentus CB15 


67.68 


0.0570 


0.113 


2.86 


0.0087 


vrt 


Gallus gallus 


52.11 


0.0239 


0.095 


1.17 


0.0129 


bet 


Synechocystis sp. PCC6803 


48.56 


0.0260 


0.083 


1.49 


0.0140 


bet 


Sulfolobulus solfataricus 


36.47 


0.0290 


0.066 


2.26 


0.0099 


bet 


Mycobacterium leprae 


59.90 


0.0252 


0.071 


1.80 


0.0065 


bet 


Brucella melitensis 


58.25 


0.0294 


0.067 


2.25 


0.0121 


bet 


Deinococcus radiodurans 


67.24 


0.0481 


0.098 


2.76 


0.0113 


vrt 


Xenopus laevis 


47.33 


0.0193 


0.084 


0.92 


0.0268 


bet 


Listeria monocytogenens 


38.39 


0.0437 


0.136 


1.64 


0.0267 


pin 


Neurospora crassa 


56.17 


0.0241 


0.086 


1.31 


0.0166 


bet 


Clostridium perfrigens 


29.47 


0.0510 


0.092 


3.11 




inv 


Leishmania major 


63.36 


0.0294 


0.069 


2.21 


0.0050 


vrt 


Bos taurus 


53.05 


0.0240 


0.089 


1.27 


0.0126 



Table 2: Values of the best-fit 



parameters, eq. Q, for the sample of chloroplasts. 



Species 


GC content in % 


a 




W(3 


x' 


Arabidopsis thaliana 


38.37 


0.0254 


0.067 


1.95 


0.0030 


Chaetosphaeridium globosum 


30.29 


0.0515 


0.110 


2.59 


0.0174 


Chlorella vulgaris 


34.63 


0.0513 


0.114 


2.04 


0.0093 


Cyanidium caldarium 


33.31 


0.0379 


0.092 


2.24 


0.0103 


Guillardia theta 


33.20 


0.0452 


0.103 


2.20 


0.0089 



Table 3: Values of the best-fit parameters, eq. for the sample of mitochondriae. 



Type 


Species 


GC content in % 


a 


7] 


W(3 


x' 


vrt 


Homo sapiens 


44.99 


0.0414 


0.099 


2.31 


0.0207 


pin 


Arabidopsis thaliana 


44.18 


0.0136 


0.049 


1.39 


0.0589 


vrt 


Mus musculus 


37.23 


0.0455 


0.104 


2.44 


0.0226 


fng 


Saccharomyces cerevisiae 


24.17 


0.0879 


0.198 


2.66 


0.0611 


inv 


Physarum polycephalum 


25.69 


0.0624 


0.128 


2.70 


0.0262 


pin 


Pylaiella littoralis 


37.06 


0.0336 


0.108 


1.72 


0.0112 


pin 


Neurospora crassa 


33.20 


0.0388 


0.101 


2.14 


0.0225 


vrt 


Bos taurus 


39.73 


0.0422 


0.106 


2.25 


0.0430 


vrt 


Sus scrofa 


40.52 


0.0497 


0.112 


2.51 


0.0372 
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Figure 1: Rank distribution of the codon usage probabilities for Homo sapiens. Circles are experi- 
mental values, squares are fitted values. 
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Figure 2: Rank distribution of the codon usage probabilities for Drosophila melanogaster. Circles 
are experimental values, squares are fitted values. 
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Figure 3: Rank distribution of the codon usage probabilities for Arabidopsis thaliana. Circles are 
experimental values, squares are fitted values. 
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Figure 4: Rank distribution of the codon usage probabilities for Escherichia coli. Circles are experi- 
mental values, squares are fitted values. 
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Figure 5: Rank distribution of the codon usage probabilities for Chloroplast Arabidopsis thaliana. 
Circles are experimental values, squares are fitted values. 




Figure 6: Rank distribution of the codon usage probabilities for Mitochondrial Homo sapiens. 
are experimental values, squares are fitted values. 
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Figure 7: Rank distribution of the codon usage probabilities for Mitochondrial Arabidopsis thaliana. 
Circles are experimental values, squares are fitted values. 




a vs. the exonic GC content f3 vs. the exonic GC content 

Table 4: Fits for the a and /3 parameters. 



Table 6: Type of used codons of the observed rank distribution f{n). 



Rank 


1 


2 


.H 


4 


5 


6 


7 


8 


9 


10 


11 


12 


l.H 


14 


15 


16 


17 


18 


19 


20 


Homo sapiens 


GAG 


CUG 


CAG 


AAG 


GAA 


GUG 


GCC 


GAC 


AAA 


GGC 


AUG 


GAU 


AUC 


UUC 


CGC 


AAC 


cue 


AGG 


AGG 


GCU 


Mus musculus 


cue 


GAG 


AAG 


CAG 


GUG 


GAG 


GAA 


GCC 


AUG 


AUG 


GGC 


UUC 


GAU 


AAA 


AAC 


cue 


GCU 


AGC 


ACC 


GCC 


Rattus norvegicus 


CUG 


GyVG 


AAG 


CAG 


GUG 


GAG 


GCC 


GAA 


yVUC 


UUG 


AUG 


AAC 


GGC 


cue 


GAU 


AAA 


AGG 


AGG 


GGU 


CGC 


Gallus gallus 


GAG 


CUG 


AAG 


CAG 


GAA 


GUG 


AAA 


GAC 


GCC 


GAU 


AAC 


AUC 


AUG 


GGC 


iVGC 


UUC 


GCU 


CCG 


UAC 


ACC 


Xenopus laevis 


GAA 


GAG 


AAA 


AAG 


CAG 


GAU 


CUG 


AUG 


GAC 


AAU 


AAC 


GGA 


GUG 


GCU 


CCA 


AUU 


GCA 


UUU 


UCU 


ACA 


Bos taurus 


CUG 


GAG 


AAG 


CAG 


GUG 


GCC 


GAG 


GAA 


AUG 


GGC 


uuc 


AAC 


AUG 


AAA 


ACC 


GAU 


cue 


CGC 


UAC 


AGC 


Arabidopsis thaliana 


GAU 


GAA 


AAG 


GAG 


AAA 


ecu 


GUU 


UCU 


AUG 


CUU 


GGA 


AAU 


GGU 


UUU 


AUU 


UUG 


AAC 


UUC 


CAA 


AGA 


Oryza sativa japonica 


GAG 


GCC 


GGC 


AAG 


GAG 


GCG 


cue 


GUG 


GAU 


AUG 


UUC 


CUG 


GAA 


CAG 


GUC 


AUC 


CCG 


GCU 


AAC 


CGC 


Oryza sativa 


GAG 


AAG 


GCC 


GGC 


GAG 


GAU 


cue 


AUG 


GUG 


GCG 


uuc 


CAG 


GAA 


AUC 


GCU 


AAC 


GUC 


CUG 


GCA 


GGG 


Neurospora crassa 


GAG 


AAG 


GCC 


GAG 


GGC 


AAG 


cue 


AUG 


CAG 


GUC 


ACC 


GAU 


CCG 


UUC 


AUG 


GAA 


GCU 


UCG 


GGU 


UAC 


Drosopliila melanogaster 


GAG 


AAG 


CUG 


GAG 


GCC 


GUG 


GAU 


GGC 


AAC 


GAC 


AUG 


AUC 


UUC 


ACC 


GAA 


AAU 


AGG 


UCC 


UAC 


CGC 


Caenortiabditis elegans 


GAA 


AAA 


GAU 


AUU 


GGA 


AAU 


CAA 


AUG 


AAG 


CCA 


UUU 


UUC 


GAG 


GUU 


GCU 


CUU 


UCA 


UUG 


ACA 


GCA 


Leishmania major 


GCG 


GAG 


GCC 


CUG 


GUG 


GGC 


GAC 


CAG 


CGC 


AAG 


GCG 


AGG 


cue 


AGG 


AUG 


AAC 


UCG 


GAG 


GCA 


UAC 


Sacch. cerevisiae 


GAA 


AAA 


GAU 


AAU 


AAG 


AUU 


CAA 


UUG 


UUA 


UUU 


AAC 


GGU 


UCU 


GUU 


AGA 


GCU 


AUG 


GAC 


ACU 


GAG 


Schizosacch. pombe 


GAA 


AAA 


GAU 


AUU 


AAU 


UUU 


ucu 


GCU 


GUU 


CAA 


UUA 


CUU 


AAG 


UUG 


ACU 


UAU 


ecu 


GGU 


GAG 


AUG 


Escherichia coli 


CUG 


GAA 


AAA 


GAU 


GCG 


AUU 


CAG 


GGC 


AUG 


GGU 


GUG 


GCC 


AUC 


UUU 


ACC 


AAC 


GCA 


CCG 


AAU 


CGU 


Bacillus subtilis 


AAA 


GAA 


AUU 


GAU 


UUU 


AUG 


AUG 


GGC 


GAG 


CUG 


CUU 


UAU 


AAU 


ACA 


GGA 


GCA 


AAG 


GCG 


GAA 


UUA 


Pseudom. aeruginosa 


CUG 


GCC 


GGC 


CGC 


GAG 


GCG 


AUG 


GAG 


CAG 


GUG 


uuc 


ACC 


CCG 


GUC 


cue 


AAG 


AGG 


GAA 


AAC 


AUG 


Mesorhizobium loti 


GGC 


GCC 


CUG 


AUG 


GCG 


GUC 


G7VC 


CGC 


UUG 


GAG 


CCG 


AAG 


cue 


gug; 


ACC 


CyVG 


AUG 


GAA 


UCG 


GAU 


Streptom. coelicolor A3 


GCC 


GGC 


CUG 


GAG 


GCG 


GAG 


GUC 


ACC 


CGC 


cue 


GUG 


CCG 


CGG 


AUC 


uuc 


GCC 


GAG 


GAC 


UCG 


AAG 


Sinorhizobium meliloti 


GGC 


GCC 


GCG 


AUG 


GUC 


cue 


CUG 


GAC 


CGC 


GAG 


UUC 


CCG 


AAG 


GAA 


AUG 


CAG 


GUG 


ACC 


ACG 


UCG 


Xostoc sp. PCC 7120 


GAA 


AUU 


GAA 


UUA 


AAA 


GAU 


AAU 


UUU 


GCU 


GGU 


GCA 


UUG 


GUU 


ACU 


UAU 


GUA 


AGA 


AUC 


GCG 


AUG 


Agrobact. tumefaciens 


GCC 


GGC 


CUG 


AUG 


GCG 


GAA 


CGC 


GUC 


UUG 


GAU 


GAC 


AAG 


cue 


CCG 


AUG 


GUG 


CAG 


GAG 


ACC 


ACG 


Ralstonia solanacearum 


CUG 


GCC 


GGC 


GCG 


CGC 


GUG 


AUG 


GAC 


GCG 


CAG 


GAG 


UUC 


ACC 


GUC 


AAG 


AGG 


AUG 


AAC 


UCG 


cue 


Yersinia pestis 


CUG 


GAU 


GAA 


AAA 


AUU 


GCC 


AUG 


GGU 


CAG 


AAU 


GCG 


GGC 


CAA 


AUC 


UUG 


GUG 


UUU 


ACC 


UUA 


GAG 


Methanosarc. acetivorans 


GAA 


AAA 


CUU 


GAU 


GGA 


AUU 


GGA 


AUG 


GAG 


UUU 


CUG 


GAC 


AUG 


AAU 


AAG 


AAG 


AUA 


GUU 


UAU 


UUC 


Vibrio cholerae 


GAA 


GAU 


AAA 


CAA 


AUU 


GCG 


GUG 


UUU 


CUG 


GGU 


AUG 


AUC 


GAG 


GGC 


UUG 


AAU 


GCC 


UUA 


GCU 


ACC 


Escherichia coli K12 


CUG 


GAA 


GCG 


AAA 


GAU 


AUU 


GGC 


CAG 


AUG 


GUG 


GCC 


AUC 


GGU 


ACC 


CCG 


UUU 


CGC 


AAC 


ecu 


GCA 


Mycobact. tuber. CDC1551 


GCC 


CUG 


GGC 


GCG 


GAG 


GUG 


ACC 


AUG 


GUC 


GCG 


GAG 


CGC 


CGG 


CAG 


UUC 


UCG 


AAC 


GGG 


GGU 


AUG 


Mycobact. tuber. H37Rv 


GCC 


GGC 


CUG 


GCG 


GAG 


GUG 


ACC 


AUG 


GUC 


GCG 


GAG 


CGC 


CGG 


UUC 


CAG 


AAC 


UCG 


CGG 


GGU 


AUG 


Bacillus halodurans 


GAA 


AUU 


AAA 


GAU 


UUU 


GAG 


CAA 


UUA 


AUG 


AUC 


UAU 


CUU 


GGA 


GUU 


AAG 


AGG 


AAU 


GCA 


GUG 


GCG 


Clostridium acetobutylicum 


AAA 


AUA 


AAU 


GAA 


GAU 


UUU 


UUA 


AUU 


UAU 


GGA 


AAG 


GUU 


GUA 


CUU 


GCA 


AUG 


AGA 


GCU 


ACA 


GGU 


Caulobacter crescentus CB15 


GCC 


CUG 


GGC 


GCG 


GAG 


CGC 


AUG 


GUC 


GAG 


ACC 


AAG 


UUC 


GUG 


CCG 


CAG 


AUG 


UCG 


AAC 


CCG 


cue 


Synechocystis sp. PCC 6803 


GAA 


AUU 


GCC 


CAA 


GAU 


UUG 


AAA 


UUU 


GUG 


ACC 


UUA 


CCG 


AAU 


GGC 


GAG 


CUG 


GCU 


GGU 


AUG 


GAC 


Sulfolobus solfataricus 


AUA 


UUA 


AAA 


GAA 


AAG 


GAU 


AUU 


AAU 


UAU 


GAG 


GUA 


GUU 


UUU 


GGA 


AGA 


GCU 


GGU 


AUG 


ACU 


GCA 


Mycobacterium leprae 


GCC 


CUG 


GUG 


GAG 


GCG 


GGC 


AUG 


GUC 


ACC 


GAG 


CCG 


UUG 


GGU 


CGC 


CAG 


GAU 


GAA 


GCU 


UUC 


CGG 


Brucella melitensis 


GGC 


GCC 


CUG 


GAA 


CGC 


GCG 


AUG 


GAU 


AAG 


GUG 


CCG 


UUC 


AUG 


CAG 


CUU 


GAC 


GUC 


ACC 


cue 


GAG 


Deinococcus radiodurans 


CUG 


GCC 


GGC 


GUG 


GCG 


GAG 


CGC 


ACC 


CAG 


cue 


GAG 


CCG 


GAA 


CCG 


AGG 


GUG 


AUC 


UUC 


GGG 


GGG 


Listeria monocytogenes 


AAA 


GAA 


AUU 


GAU 


UUA 


AAU 


UUU 


CAA 


GCA 


GUU 


AUG 


ACA 


GGU 


UAU 


GCU 


GUA 


CUU 


GGA 


AUC 


CCA 


Clostridium perfringens 


AAA 


GAA 


UUA 


AUA 


AAU 


GAU 


GGA 


UUU 


GUU 


UAU 


AUU 


GCU 


AGA 


AAG 


GUA 


AUG 


ACU 


UCA 


GCA 


ACA 



Table 7: Values of the best-fit parameters for the amino-acids. 



Species 








l~roTrio s^irjiPTis 


3.8 


0.089 


0.0072 




3.8 


0.090 


0068 


Drosonhila, Tnelano6''aster 


3.5 


0.087 


0.0125 


CXKjliyJL 11 CI IL/ U.1 L IkJ CIC ti ClllkJ 


3 3 


084 


0.0124 


A/Tim mnspiilnc; 

IVl no 111 LlkjVj Lll LlkJ 


3.7 


0.088 


0.0087 


S^irrli^irnTTivpp^ pprpvisi^ip 

kJOjV_>V_>-LiOj-L WJ-J-iyV^CO ^^CilCVlkjlClC 


3.9 


0.090 


0.0121 


F';'^r}i PT'i rh i poI i 

1 . 1 1 V . 1 ±V_j±±±Ct V^Wll 


4.0 


0.091 


0.0115 


Rattus norveffinis 

J. L;C*j D D LlkJ 11 Wl V ti.lv^ lAO 


3.7 


0.088 


0.0084 


V_y J. V ZJ C V J < V U 1 V Hi 1 I. J Vyillv^ClJ 


4.1 


0.093 


0.0057 


v iiiz. v./of I v_, v_,ii<:i i v.Jiii \ v_, ijy.i iLi'j\r 


3 8 


n 089 


0.0162 


R^^pilliicj sn}~jtilis 

J-JOV^lll LlkJ OLllL/LlllkJ 


4.0 


0.091 


0.0104 


P'^piiHoTTinTii^^ ^^pni prinncja 

1 kjC LIV_1 WJ-llWlJ-ClkJ <XV -L LI tllllWkjCl 


4.9 


0.101 


0.0493 


A/fp^nrh i ynlii n tti 1 oti 

± V J. V^O Wl 1 J-l/j W U 1 L-lJ-i-i. ±KJ L) 1 


4.7 


0.100 


0.0215 


StTPntoTTivrp^ poplipolor 

Ll C L Will y V^CO V^WCllVjVJlWl ' 1 


5.6 


0.109 


0.0624 


SiTiorhi7ohiinm mpliloti 

kJlllvyi IIIZJ k^l LI 111 111 v^lllvy U 1 


4.7 


0.100 


0.0188 


Nostoc SD PCC 7120 


4.0 


0.092 


0.0174 


V ./ lyzjOj kjciijivci 


3 9 


091 


0028 


A f>'rnl'~i;^rtpri'iiTn t n tti pf?i pi pnc; cjt,r (HFi^ 

Ji- til W IJ<_X,\./ l-'vI^J- 1 l_l-Lli I.' l.llllvTTlclV^ldlkJ kJ LI ■ V^t-ZU 


4.6 


0.098 


0.0144 


R ^1 1 ston i 'A sol ^ n ^1 pp^ ri i m 

1 LQjlkJ L WlllOj kJ WlClllClV^CCil LI 111 


4.7 


0.101 


0.0351 


\^prsiTiiji TiPstis 

1 ClkjllllCl L/CkjLlkJ 


4.0 


0.092 


0.0135 


A/TpthanosarriTia arpti vorans str d2 A 

J.VX U 11 CtllvykJCtl v^lllCt (A U I V \J1. CtllkJ kJ U 1 ■ \^ ^ 1 \. 


4.1 


0.092 


0063 


Vibrio rholerae 


3.9 


0.091 


0.0148 


Flsrherirhia roli TCI 9 

J (Ov^llv^l IV^llltlJ V^Wll A. ^ 


4.0 


0.091 


0.0154 


A/Tvpn]~iaptprinm t,ii}~tprpnlrmi*^ HFil 

IVl yV^WUdV-^udlLllll LiLlUdV^LllWkJlkJ V_ J_/ V_ J.%J%J ±. 




105 


0.01121 


A/Tvpn]~iapt,priiim t,ii}~iprpnlo^i^ l~r/571~? v 

IVl y VyVy kJCX\ I v. -L I Lllll LLlUClVjLllWkjlO 1. l-KJ 1 XLiV 


5.3 


0.106 




Rapilliic; lialorliiraTm 

J-JOV^lll LlkJ llOjlWJ. Lll Olio 


4.0 


0.091 


0.0100 


v_ ■ J.V./0 I.' J- IV. 11 1. <H'V_, V_, 1.1 V./ 1. 1 I.' V ±1 V_, Hi. 


4.6 


0.097 


0.0076 


Caiilobarter rresrentiis CIRI 5 


5.1 


0.104 


0.0524 


Oalliis ffallus 

CIj 11 LIO C^CIjII LIO 


3.6 


0.088 


0.0040 


Svnerhorvstis sn P(~I(~I fiRD.S 

kJ y iiv^V-^iiwv^ y o iiio ow. i v, / v_y xjuvytj 


4.1 


0.093 


0.0168 


Siilfolobus solfatarinis 


4.4 


0.096 


0.0143 


Mycobacterium leprae 


4.9 


0.101 


0.0401 


Brucella melitensis 


4.5 


0.097 


0.0142 


Deinococcus radiodurans 


5.2 


0.105 


0.0679 


Xenopus laevis 


3.5 


0.086 


0.0084 


Listeria monocytogenes 


4.2 


0.093 


0.0088 


Neurospora crassa 


4.0 


0.091 


0.0042 


Clostridium perfringens 


4.6 


0.098 


0.0035 


Leishmania major 


4.7 


0.099 


0.0367 


Bos taurus 


3.6 


0.087 


0.0082 



Table 8: Amino-acids rank distributions. 
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Table 10: Type of used amino-acids of the observed rank distribution. 



Rank 


1 


2 


3 


4 


,5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


1,5 


16 


17 


18 


19 


20 


Homo sapiens 


Leu 


Ser 


Ala 


Glu 


Gly 


Val 


Pro 


Lys 


Arg 


Thr 


Asp 


Gin 


He 


Phe 


Asn 


Tyr 


His 


Met 


Cys 


Ti-p 


Mus musculus 


Leu 


Ser 


Ala 


Gly 


Glu 


Val 


Pro 


Lys 


Arg 


Thr 


Asp 


He 


Gin 


Phe 


.Asn 


Tyr 


His 


Cys 


Met 


Trp 


Rattus norvegicus 


Leu 


Ser 


Ala 


Glu 


Gly 


Val 


Pro 


Lys 


Thr 


Arg 


yVsp 


He 


Gin 


Phe 


Asn 


Tyr 


His 


Met 


Cys 


Trp 


Gallus gallus 


Leu 


Ser 


Glu 


Ala 


Gly 


Lys 


Val 


Pro 


Thr 


Arg 


Asp 


He 


Gin 


Asn 


Phe 


Tyr 


His 


Met 


Cys 


Trp 


Xenopus laevis 


Leu 


Ser 


Glu 


Lys 


Ala 


Gly 


Val 


Pro 


Thr 


Asp 


Arg 


He 


Gin 


Asn 


Phe 


Tyr 


Met 


His 


Cys 


Trp 


Bos taums 


Leu 


Ser 


Ala 


Gly 


Glu 


Val 


Lys 


Pro 


Thr 


Arg 


Asp 


He 


Gin 


Phe 


Asn 


Tyr 


Cys 


His 


Met 


Trp 


Arabidopsis thaliana 


Leu 


Ser 


Val 


Glu 


Gly 


Ala 


Lys 


Asp 


Arg 


He 


Thr 


Pro 


Asn 


Phe 


Gin 


Tyr 


Met 


His 


Cys 


Trp 


Oryza sativa japonica 


Ala 


Leu 


Gly 


Ser 


Arg 


Val 


Cdu 


Pro 


Asp 


Thr 


Lys 


He 


Phe 


Gin 


Asn 


His 


Tyr 


Met 


Cys 


Trp 


Oryza sativa 


Ala 


Leu 


Gly 


Ser 


Arg 


Val 


Glu 


Pro 


Asp 


Lys 


Thr 


He 


Phe 


Gin 


.Asn 


Tyr 


His 


Met 


Cys 


Trp 


Neurospora crassa 


Ala 


Leu 


Ser 


Gly 


Glu 


Pro 


Arg 


Thr 


Val 


Asp 


Lys 


He 


Gin 


Asn 


Phe 


Tyr 


His 


Met 


Trp 


Cys 


Drosopliila melanogaster 


Leu 


Ser 


Ala 


Glu 


Gly 


Val 


Lys 


Thr 


Arg 


Pro 


Asp 


Gin 


He 


.Asn 


Phe 


Tyr 


His 


Met 


Cys 


Trp 


Caenorliabditis elegans 


Leu 


Ser 


Glu 


Lys 


Ala 


Val 


He 


Thr 


Gly 


Arg 


Asp 


Asn 


Phe 


Pro 


Gin 


Tyr 


Met 


His 


Cys 


Trp 


Leislimania major 


Ala 


Leu 


Ser 


Arg 


Val 


Gly 


Thr 


Pro 


Glu 


Asp 


Gin 


Lys 


He 


His 


Phe 


Asn 


Tyr 


Met 


Cys 


Trp 


Saccii. cerevisiae 


Leu 


Ser 


Lys 


He 


Glu 


Asn 


Thr 


Asp 


Val 


Ala 


Gly 


Arg 


Phe 


Pro 


Gin 


Tyr 


His 


Met 


Cys 


Trp 


Schizosacch. pombe 


Leu 


Ser 


Glu 


Lys 


Ala 


He 


Val 


Thr 


Asp 


Asn 


Gly 


Arg 


Pro 


Phe 


Gin 


Tyr 


His 


Met 


Cys 


Trp 


Escherichia coli 


Leu 


Ala 


Gly 


Val 


Ser 


He 


Glu 


Thr 


Arg 


Asp 


Lys 


Gin 


Pro 


Asn 


Phe 


Tyr 


Met 


His 


Trp 


Cys 


Bacillus subtilis 


Leu 


Ala 


He 


Glu 


Lys 


Gly 


Val 


Ser 


Thr 


Asp 


Phe 


Arg 


Asn 


Gin 


Pro 


Tyr 


Met 


His 


Trp 


Cys 


Pseudom. aeruginosa 


Leu 


Ala 


Gly 


Arg 


Val 


Glu 


Ser 


Asp 


Pro 


Gin 


Thr 


He 


Phe 


Lys 


.Asn 


Tyr 


His 


Met 


Trp 


Cys 


Mesorhizobium loti 


Ala 


Leu 


Gly 


Val 


Arg 


Ser 


Asp 


He 


Glu 


Thr 


Pro 


Phe 


Lys 


Gla 


Asn 


Met 


Tyr 


His 


Trp 


Cys 


Streptom. coelicolor A3 


Ala 


Leu 


Gly 


Val 


Arg 


Pro 


Thr 


Asp 


Glu 


Ser 


He 


Phe 


Gin 


His 


Lys 


Tyr 


.Asn 


Met 


Trp 


Cys 


Sinorhizobium meliloti 


Ala 


Leu 


Gly 


Val 


Arg 


Glu 


Ser 


He 


Asp 


Thr 


Pro 


Phe 


Lys 


Gin 


.Asn 


Met 


Tyr 


His 


Trp 


Cys 


Xostoc sp. PCC 7120 


Leu 


Ala 


He 


Val 


Gly 


Ser 


Glu 


Thr 


Gin 


Arg 


Lys 


Asp 


Pro 


Asn 


Phe 


Tyr 


His 


Met 


Trp 


Cys 


Agrobact. tumefaciens 


Ala 


Leu 


Gly 


Val 


Arg 


Ser 


Glu 


He 


Asp 


Thr 


Pro 


Phe 


Lys 


Gin 


.Asn 


Met 


Tyr 


His 


Trp 


Cys 


Ralstonia solanacearum 


Ala 


Leu 


Gly 


Val 


Arg 


Thr 


Asp 


Pro 


Ser 


Glu 


He 


Gin 


Phe 


Lys 


Asn 


Tyr 


His 


Met 


Trp 


Cys 


Yersinia pestis 


Leu 


Ala 


Gly 


Val 


Ser 


He 


Glu 


Thr 


Arg 


Asp 


Gin 


Lys 


Pro 


.Asn 


Phe 


Tyr 


Met 


His 


Trp 


Cys 


Methanosarc. acetivorans 


Leu 


Glu 


He 


Gly 


Ser 


Ala 


Val 


Lys 


Thr 


Asp 


Arg 


Asn 


Phe 


Pro 


Tyr 


Gin 


Met 


His 


Cys 


Trp 


Vibrio cholerae 


Leu 


Ala 


Val 


Gly 


Ser 


He 


Glu 


Thr 


Asp 


Gin 


Lys 


Arg 


Phe 


Asn 


Pro 


Tyr 


Met 


His 


Trp 


Cys 


Escherichia coli K12 


Leu 


Ala 


Gly 


Val 


He 


Ser 


Glu 


Arg 


Thr 


Asp 


Gin 


Pro 


Lys 


Asn 


Phe 


Tyr 


Met 


His 


Trp 


Cys 


Mycobact. tuber. CDC1551 


Ala 


Leu 


Gly 


Val 


Arg 


Thr 


Pro 


Asp 


Ser 


Glu 


He 


Gin 


Phe 


Asn 


His 


Tyr 


Lys 


Met 


Trp 


Cys 


Mycobact. tuber. H37Rv 


Ala 


Gly 


Leu 


Val 


Arg 


Thr 


Asp 


Pro 


Ser 


Glu 


He 


Gin 


Phe 


Asn 


His 


Tyr 


Lys 


Met 


Trp 


Cys 


Bacillus halodurans 


Leu 


Glu 


Val 


Ala 


Gly 


He 


Lys 


Ser 


Thr 


Asp 


Arg 


Phe 


Gin 


Pro 


Asn 


Tyr 


Met 


His 


Trp 


Cys 


Clostridium acetobutylicum 


He 


Lys 


Leu 


Ser 


Glu 


Val 


Asn 


Gly 


Ala 


Asp 


Thr 


Phe 


Tyr 


Arg 


Pro 


Met 


Gin 


His 


Cys 


Trp 


Caulobacter crescentus CB15 


Ala 


Leu 


Gly 


Val 


Arg 


Asp 


Pro 


Glu 


Thr 


Ser 


He 


Phe 


Lys 


Gin 


Asn 


Met 


Tyr 


His 


Trp 


Cys 


Synechocystis sp. PCC 6803 


Leu 


Ala 


Gly 


Val 


He 


Glu 


Ser 


Gin 


Thr 


Pro 


Arg 


Asp 


Lys 


.Asn 


Phe 


Tyr 


Met 


His 


Trp 


Cys 


Sulfolobus solfataricus 


Leu 


He 


Lys 


Val 


Glu 


Ser 


Gly 


Ala 


Asn 


Tyr 


Arg 


Thr 


Asp 


Phe 


Pro 


Gin 


Met 


His 


Trp 


Cys 


Mycobacterium leprae 


Ala 


Leu 


Val 


Gly 


Arg 


Thr 


Ser 


Asp 


Pro 


Glu 


He 


Gin 


Phe 


Lys 


Asn 


His 


Tyr 


Met 


Trp 


Cys 


Brucella melitensis 


Ala 


Leu 


Gly 


Val 


Arg 


He 


Glu 


Ser 


Asp 


Thr 


Pro 


Lys 


Phe 


Gin 


.Asn 


Met 


Tyr 


His 


Trp 


Cys 


Deinococcus radiodurans 


Ala 


Leu 


Gly 


Val 


Arg 


Pro 


Thr 


Glu 


Ser 


yVsp 


Gin 


He 


Phe 


Lys 


Asn 


Tyr 


His 


Met 


Trp 


Cys 


Listeria. monoc>'togenes 


T,eu 


Tie 


.Ala 


Glu 


La-s 


\'al 


Gh- 


Thr 


Ser 


.Asp 


Asn 


Phe 


Arg 


Pro 


Gin 


T>-r 


Met 


His 


Trp 


Cys 


Clostridium perfringens 


He 


Leu 


Lys 


(du 




\'al 


Asn 


Ser 


Asp 


Ala 


Thr 


Phe 


Tyr 


Arg 


Pro 


Met 


C.hi 


His 


C'j-s 


Trp 
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